Olympic Data
Olympic Data
- 1 Introduction
- 2 Summary of Dataset
- 3 Description of Data
- 4 BMI Data
- 4.1 BMI Boxplots of Athletes’ Height, Weight and Age
- 4.2 Boxplot of All Olympic Athletes BMI
- 4.3 Boxplot of BMI vs Sport (Summer)
- 4.4 Boxplot of BMI vs Sport (Winter)
- 4.5 Boxplot of BMI vs Top 10 Events
- 4.6 BMI of Medal Winning Athletes
- 4.7 US Olympic Team
- 4.8 T-Intervals 95% confidence
- 4.9 T-Test
- 4.10 Anova Test comparing BMI Averages of Athletes From Different Olympic Sporting Events
- 5 Geographical Data
- 6 GDP Data
- 7 Name Data
- 8 Trends Over Time
- 9 Age Data
- 10 Discussion & Conclusion
- 10.1 Summary of the BMI of olympic athletes
- 10.2 Summary of the geographical data of olympic athletes
- 10.3 Summary of the effect of a country’s GDP and Population on the number of medals won by that country’s athletes
- 10.4 Summary of the most common names among olympic athletes
- 10.5 Summary of changes over time in weights and heights of olympic athletes
- 10.6 Summary of changes over time in ages of olympic athletes
1 Introduction
Team 010100 are the following members: Obumneke Amadi, Izzy Illari, Lucia Illari, Omar Qusous, and Lydia Teinfalt. You may find our work over on GitHub.
With the 2020 Olympics beginning this July in Tokyo we felt that a relevant discussion to have would be What makes an Olympian? What can we say about Olympians? Have there been any general trends amongst Olympians? What does the Olympic population look like? These questions are all suited to EDA, and with these questions in mind we went to see if we could find data on Olympians that would be readily available for us to analyze. Eventually our question morphed into the following: are there any specific characteristics (i.e. age, weight, height, BMI, country of origin) that could be used to describe an Olympian in general?
We were able to find a dataset called 120 years of Olympic history: athletes and results on Kaggle over here: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results. This historical dataset includes all Olympic Games from Athens 1896 to Rio 2016, which was scraped from https://www.sports-reference.com/. This data was compiled by a group of Olympic historians and statisticians. All of these individuals are members of the International Society of Olympic Historians (ISOH) and have been working on this project since the late 1990s.
The report is organized as follows:
- Summary of Dataset
- Description of Data
- What are the body types of Olympic Athletes? (BMI Data)
- Where do Olympians come from? (Geographical Data)
- Does a country’s GDP and Population affect the number of medals that its athletes win? (GDP Data)
- Are there any common names among Olympians? (Name Data)
- Have there been any body (weight and height) trends among Olympians over the years? (Trends Over Time)
- Are Olympic Athletes a certain age? (Age Data)
- Discussion And Conclusion
2 Summary of Dataset
The data looks like the following:
'data.frame': 271116 obs. of 15 variables:
$ ID : int 1 2 3 4 5 5 5 5 5 5 ...
$ Name : Factor w/ 134732 levels " Gabrielle Marie \"Gabby\" Adcock (White-)",..: 8 9 44318 29412 21469 21469 21469 21469 21469 21469 ...
$ Sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 1 1 1 1 ...
$ Age : int 24 23 24 34 21 21 25 25 27 27 ...
$ Height: int 180 170 NA NA 185 185 185 185 185 185 ...
$ Weight: num 80 60 NA NA 82 82 82 82 82 82 ...
$ Team : Factor w/ 1184 levels "30. Februar",..: 199 199 273 278 705 705 705 705 705 705 ...
$ NOC : Factor w/ 230 levels "AFG","AHO","ALB",..: 42 42 56 56 146 146 146 146 146 146 ...
$ Games : Factor w/ 51 levels "1896 Summer",..: 38 49 7 2 37 37 39 39 40 40 ...
$ Year : int 1992 2012 1920 1900 1988 1988 1992 1992 1994 1994 ...
$ Season: Factor w/ 2 levels "Summer","Winter": 1 1 1 1 2 2 2 2 2 2 ...
$ City : Factor w/ 42 levels "Albertville",..: 6 18 3 27 9 9 1 1 17 17 ...
$ Sport : Factor w/ 66 levels "Aeronautics",..: 9 33 25 62 54 54 54 54 54 54 ...
$ Event : Factor w/ 765 levels "Aeronautics Mixed Aeronautics",..: 160 398 349 710 623 619 623 619 623 619 ...
$ Medal : Factor w/ 3 levels "Bronze","Gold",..: NA NA NA 2 NA NA NA NA NA NA ...
The athlete events data has 15 columns and 271116 rows/entries, for a total of 4066740 individual data points. In athelete_events each row corresponds to an individual athlete competing in an individual Olympic event. The variables are the following:
- ID: Unique number for each athlete
- Name: Athlete’s name
- Sex: M or F
- Age: Integer
- Height: centimeters
- Weight: kilograms
- Team: Team name
- NOC: National Olympic Committee 3-letter code
- Games: Year and season
- Year: Integer
- Season: Summer or Winter
- City: Host city
- Sport
- Event
- Medal: Gold, Silver, Bronze, or NA
To prepare our data for EDA we dropped the Olympic event: Art Sculpting. NAs were also removed.
3 Description of Data
We can look at the top events by number of athetes participating in these events. We can show this in a table or in a bar chart.
| Sport | freq | |
|---|---|---|
| 4 | Athletics | 3648 |
| 44 | Swimming | 2486 |
| 33 | Rowing | 2104 |
| 26 | Ice Hockey | 1301 |
| 25 | Hockey | 1168 |
| 23 | Gymnastics | 1161 |
| 18 | Fencing | 1109 |
| 20 | Football | 1084 |
| 12 | Canoeing | 1041 |
| 7 | Basketball | 1000 |
| 55 | Wrestling | 967 |
| 52 | Volleyball | 958 |
| 24 | Handball | 937 |
| 15 | Cycling | 845 |
| 53 | Water Polo | 764 |
4 BMI Data
Body Mass Index (BMI) is used to group individuals into weight categories that may lead to health problems. Olympic athletes represent top physical fitness so we wanted to see if Olympians have healthy weight according to the CDC.
BMI is calculated using athlete’s weight in kilograms divided by the square of their height in meters. The dataset stored height in centimeters so the formula was modified to convert their height to meter. The resulting number can be used to classify athtletes older than 20 into one of these groups: underweight, normal or healthy weight, overweight, and obese.
| Weight.Category | BMI |
|---|---|
| Underweight | Less than 18.5 |
| Healthy Weight | Between 18.5 and 24.9 |
| Overweight | Between 25 and 29.9 |
| Obese | Greater than 30 |
4.1 BMI Boxplots of Athletes’ Height, Weight and Age
The above chart comparing height against categories show averages between female and male athletes are not equal across the board. It does not matter if the Olympian is underweight or obese, the male athletes are on average taller than the female athletes. For healthy and overweight athletes, there are more outliers in the data than the underweight and obese categories.
This chart shows that average weight increases from underweight to obese. This is what we would expect. There is very little variability in weight for underweight Olympians and much greater variablity in the weight for obese athletes. As we found with height, the male athletes on average weigh more than the female athletes.
Average ages are close irrespective of weight category or gender. Notable are the number of outliers in age data for healthy and overweight Olympians.
4.2 Boxplot of All Olympic Athletes BMI
The histogram of relative frequency versus BMI of female and male Olympians clearly show that not all athletes have healthy weight according to the CDC.
4.3 Boxplot of BMI vs Sport (Summer)
We wondered about the differences between Olympians who competed in summer versus winter events.
In the summer events, the boxplots show that athletes in the overweight category competed in basketball, boxing, football, ice hockey, rugby, shooting, tug-of-war, weightlifting, and wrestling. There was great variability in BMI for weightlifters, at the top – they were considered obese. In In the underweight category were female rhythmic gymnasts and gymnasts.
4.4 Boxplot of BMI vs Sport (Winter)
In the winter events, the boxplots showed male athletes in the overweight category competed in alpine sking, bobsleigh, curling, freestyle skiing, ice hockey, luge, and snowboarding. The female figure skaters were in the underweight category. Given that there are not that many events in the winter Olympics but athletes in half of the events are upper end of healthy weight to being overweight, it seems that more Winter Olympians are in the upper end of normal to being overweight. The nature of winter events requiring athletes to be bulkier and bigger in order to be competitive.
4.5 Boxplot of BMI vs Top 10 Events
The boxplot showing BMI data for top 10 events with the most athletes is easier to read and confirms that for events such as basketball, handball, ice hockey, water polo, and wrestling the male athletes are considered overweight by CDC standards. The lower end of the weight categorization is gymnastics.
4.6 BMI of Medal Winning Athletes
We looked at sporting events in order to learn about summer versus winter athletes and found there are more overweight winter athletes. The events played a strong role in determing the BMI classification of the athletes. We wanted to find out winning a medal also show common characeristics between athletes. Did Gold/Silver/Bronze medalists have more in common with each other irrespective of events they competed in?
The boxplot and histograms showing BMI data for Gold/Silver/Bronze medal winners was relatively normal.
We created a scatterplot of height and weight of athletes that competed in the top 10 events to discern groupings based on event but did not find any conclusive visual evidence.
4.7 US Olympic Team
We haveve looked at data from the perspective of season, events and medals. Let’s focus on US Olympic Team data in table format to see if we can draw any conclusions.
Gymnasts on the US Olympic team are on the lower spectrum of healthy weight range. Wrestlers were second shortest but they have the highest BMI, making them classified as overweight. Consistent with the analysis we found in previous sections, US wrestlers and ice hockey players were classified as overweight.
| Sport | Avg Height (m) | Avg Weight (kg) | BMI |
|---|---|---|---|
| Gymnastics | 1.64 | 59.4 | 21.9 |
| Wrestling | 1.73 | 76.8 | 25.2 |
| Hockey | 1.74 | 69.3 | 22.9 |
| Football | 1.75 | 70.8 | 22.9 |
| Athletics | 1.78 | 71.8 | 22.5 |
| Cycling | 1.78 | 73.2 | 23.1 |
| Fencing | 1.78 | 72.3 | 22.7 |
| Ice Hockey | 1.79 | 81.2 | 25.2 |
| Canoeing | 1.80 | 78.0 | 24.1 |
| Handball | 1.83 | 81.5 | 24.0 |
| Swimming | 1.83 | 76.2 | 22.5 |
| Rowing | 1.85 | 81.2 | 23.6 |
| Water Polo | 1.86 | 85.8 | 24.8 |
| Volleyball | 1.87 | 79.8 | 22.7 |
| Basketball | 1.92 | 86.8 | 23.3 |
4.8 T-Intervals 95% confidence
| Variable | Gender | level 0.95 |
|---|---|---|
| BMI | Sex = M | [23.982, 24.064] |
| BMI | Sex = F | [21.802, 21.906] |
4.9 T-Test
The p-value 2.2e-16 being small enough to be considered 0, hence used a two-sided test we can reject the null hypothesis that the means of BMI values being equal between male and female Olympians The plots indicate that male Olympian athletes’ BMI to be greater than female Olympian athletes. Female and male athletes BMI: 21.9, 24.0
| Variable | Gender | Average |
|---|---|---|
| BMI | Sex = F | [21.9] |
| BMI | Sex = M | [24] |
4.10 Anova Test comparing BMI Averages of Athletes From Different Olympic Sporting Events
Df Sum Sq Mean Sq F value Pr(>F)
Sport 14 16916 1208 205 <0.0000000000000002 ***
Residuals 18632 109780 6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The small p-value would confirm that there are significant differences in BMI averages of athletes in each sports which can be confirmed by the plots from sections 4.3-4.4
Many Olympic athletes’ builds are more muscular than the average person. They may have BMI that are technically categorized as overweight but it is not necessarily unhealthy. Many BMI charts point out it improperly categorizes athletes and our analysis confirms this.
5 Geographical Data
6 GDP Data
7 Name Data
8 Trends Over Time
8.1 Trends in Specific Sports
We wanted to see the changes in Olympians for the top 10 events (in terms of how many Olympians were recorded for the sport) over the years. Below are the top 10 events:
| sport.names | sport.counts | |
|---|---|---|
| Var1.55 | Swimming | 2486 |
| Var1.44 | Rowing | 2104 |
| Var1.31 | Ice Hockey | 1301 |
| Var1.30 | Hockey | 1168 |
| Var1.28 | Gymnastics | 1161 |
| Var1.23 | Fencing | 1109 |
| Var1.25 | Football | 1084 |
| Var1.15 | Canoeing | 1041 |
| Var1.9 | Basketball | 1000 |
| Var1.66 | Wrestling | 967 |
It seems that in general events like rowing or basketball have been trending towards larger and larger athletes (in terms of both weight and height) whereas events like Gymnastics skews in the completely opposite direction, favoring smaller and smaller athletes. The Wrestling event data is a bit misleading—we have taken the averages per Olympic game year, and if we were to look at the Wrestling data by itself we would see clear categories for weight and height. This is due to the nature of the sport, and its weight classes. This is why in the men’s data it looks like the data for the Wrestling is a constant straight line. It appears that way because there the following weight classes at the Olympics: Freestyle weight classes (12): Men’s 57kg, Men’s 65kg, Men’s 74kg, Men’s 86kg, Men’s 97kg, Men’s 125kg, Women’s 50kg, Women’s 53kg, Women’s 57kg, Women’s 62kg, Women’s 68kg, Women’s 76kg; and then Greco-Roman weight classes (6): Men’s 60kg, Men’s 67kg, Men’s 77kg, Men’s 87kg, Men’s 97kg, Men’s 130kg.
8.2 Changes in Weights and Heights Over The Decades
What we can also use this over time data for is to see trends during “decades”. We can overlay the data for the decades ontop of each other on a histogram, or we could use boxplots where we’ve made a decade into a “rank”/factorial data type.
Now that we’ve created a new comlumn Decade in the dataframe, we can make histograms or boxplots of the Weight and Height and group by Decade.
8.3 ANOVA Testing on the Trends Over the Decades
We have the boxplots, but we can use ANOVA to test the difference between the means in the weight and height data over the decades. ANOVA testing we have that:
- Null hypothesis: the means of the different groups are the same
- Alternative hypothesis: At least one sample mean is not equal to the others.
In one-way ANOVA test, a significant \(p\)-value indicates that some of the group means are different, but we don’t know which pairs of groups are different. We use a significance level of \(\alpha = 0.05\).
Here is the ANOVA test for the weight data over the decades
Df Sum Sq Mean Sq F value Pr(>F)
Decade 12 4547 379 1.68 0.063 .
Residuals 30168 6790473 225
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that for aov(Weight ~ Decade, data=athlete_events_noNA), we have \(p\) = 6.341e-02. The \(p\)-value here is not significant, which means that over time, the mean weight of the athletes is not significantly different between the decades.
Here is the ANOVA test for the height data over the decades
Df Sum Sq Mean Sq F value Pr(>F)
Decade 12 10066 839 7.05 0.00000000000058 ***
Residuals 30168 3591551 119
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that for aov(Height ~ Decade, data=athlete_events_noNA), we have \(p\) = 5.807e-13. The \(p\)-value here is very significant, which means that over time, the mean height of the athletes is significantly different between the decades.
9 Age Data
| Medal | mean |
|---|---|
| Gold | 25.9 |
| Silver | 26.0 |
| Bronze | 25.9 |
It appears that the mean age that an Olympic Medalist wins a medal is around 26 years old. We can also look at the ages of medal-winning athletes separated by the Summer and Winter Games.
From the plots we can see that between WW1 and WW2 the average age of medalists is decreasing, but after WW2 the average age temporarily rose. We see that the age begins to decrease until 1980 but then rises again after 1980. The age seems to plateau in the 2010s.
It seems that there are fewer peaks and dips in the Winter Games data than in the Summer Games data, where the Winter Athletes seem to have a smaller variance in age. We can look at the Summer and Winter Games together.
It appears that after the 1950s the athletes at the Winter Games, on average, are older. Both Summer and Winter Games experience an upward trend in ages after the 1980s.
We can also look at the breakdown of ages between Season and Gender.
When we look at the medal-winning athletes during the Summer Games by gender we see that in general men get medals at older ages than women do. Just as with the Summer Games we see that, on average, male athletes tend to be older than female athletes.
10 Discussion & Conclusion
10.1 Summary of the BMI of olympic athletes
10.2 Summary of the geographical data of olympic athletes
10.3 Summary of the effect of a country’s GDP and Population on the number of medals won by that country’s athletes
10.4 Summary of the most common names among olympic athletes
10.5 Summary of changes over time in weights and heights of olympic athletes
To see the changes in the athletes’ weights and heights over time we narrowed our analysis to just the “top 10” events. A “top” event was simply defined by the total number of athletes that had participated in said event in total, which meant that although we were reducing the number of data analyzed we still had a large amount of athletes still left to look at. The category “Athletics”, although the most “popular” in terms of it being the event with the single highest number of participants, was excluded due to the remarkable diversity of subcategories included in that one single event (which includes track and field, road running, cross country running, and race walking). This diversity of subcategories made it difficult to determine how the average athlete in said event looks, unlike the other events which are single sports such as basketball or swimming.
It seems that over time the weights and heights of Olympic athletes has been moving towards extremes. We can see how, for example, the Olympians in gymnastics have become much lighter and much shorter, while the opposite is true for events like basketball, rowing, and swimming. We see that being an Olympic athlete very much means that your body type is an outlier, and the kind of outlier it is (very light or heavy, very short or tall) is influenced by the type of event that the athlete is competing in. We can see just how the number of outliers in the weights and heights of the athletes increases as we go through the decades, where the most recent decade of Olympic games shows the most outliers in both weight and height data than any other decade.
10.6 Summary of changes over time in ages of olympic athletes
While the weights and heights of the athletes seem to be affected by the event in which they particpate (or, rather, that athletes with extreme body types perhaps go to the Olympics more than individuals with “average” body types), the ages of the Olympic athletes seems to be affected by global events—which makes sense, considering that the Olympics are, in and of themselves, global events. The overall average age of all Olympians, even amongst the medals winners, is roughly 26 years old.
During the two World Wars the average ages of the athletes are higher than other periods, presumably because the younger athletes we fighting in the wars. A good three quarters of the Olympic athletes have been men, and so it makes sense to see this shift in the mean age when we suddenly lose a good majority of the young male athletes to war. After WW2 we see a decrease in the average age of Olympians, indicating that we have a new group of young atheletes participating in the games. Since roughly the 1980s, there has been a slow increase in the average ages of Olympians. On average female athletes have fairly consistently been younger than their male counterparts, and the Olympians participating in the Winter games are younger than the Olympians participating in the Summer games.